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We study time-bounded reachability in continuous-time Markov decision processes for time-abstract 
scheduler classes. Such reachability problems play a paramount role in dependability analysis and 
the modelling of manufacturing and queueing systems. Consequently, their analysis has been studied 
intensively, and techniques for the approximation of optimal control are well understood. From a 
mathematical point of view, however, the question of approximation is secondary compared to the 
fundamental question whether or not optimal control exists. 

We demonstrate the existence of optimal schedulers for the time-abstract scheduler classes for all 
CTMDPs. Our proof is constructive: We show how to compute optimal time-abstract strategies with 
finite memory. It turns out that these optimal schedulers have an amazingly simple structure — they 
converge to an easy-to-compute memoryless scheduling policy after a finite number of steps. 

Finally, we show that our argument can easily be lifted to Markov games: We show that both 
players have a likewise simple optimal strategy in these more general structures. 

1 Introduction 

Markov decision processes (MDPs) are a framework that incorporates both nondeterministic and proba- 
bilistic choices. They are used in a variety of applications such as the control of manufacturing processes 
|[T2l HI or queueing systems lfl6l . We study a real time version of MDPs, continuous-time Markov 
decision processes (CTMDPs), which are a natural formalism for modelling in scheduling (4) [T2) and 
stochastic control theory 0. CTMDPs can also be seen as a unified framework for different stochastic 
model types used in dependability analysis |[T5l IT2l l9l P71 ITOll . 

The analysis of CTMDPs usually concerns the different possibilities to resolve the nondeterminism 
by means of a scheduler (also called strategy). Typical questions cover qualitative as well as quantitative 
properties, such as: "Can the nondeterminism be resolved by a scheduler such that a predefined property 
holds?" or respectively "Which scheduler optimises a given objective function?". 

As a slight restriction, nondeterminism is either always hostile or always supportive in CTMDPs. 
Markov games [6] provide a generalisation of CTMDPs by disintegrating the control locations into 
locations where the nondeterminism is resolved angelically (supportive nondeterminism) and control 
locations where the nondeterminism is resolved demonically (hostile nondeterminism). 

In this paper, we study the maximal time-bounded reachability problem |[T2l |2l [TU [TO] [TT] in 
CTMDPs and Markov games. Time-bounded reachability is the standard control problem to construct 
a scheduler that controls the Markov decision process such that the likelihood of reaching a goal region 
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within a given time bound is maximised, and to determine the probability. For games, both the angelic 
and the demonic nondeterminism needs to be resolved at the same time. 

The obtainable quality of the resulting scheduling policy naturally depends on the power a sched- 
uler has to observe the run of the system and on its ability to store and process this information. The 
commonly considered schedulers classes and their basic connections have been discussed in the litera- 
ture ifTOlHTI . Thereof, we consider those schedulers that have no direct access to time, the time-abstract 
schedulers. The time-abstract scheduler classes that can observe the history, its length, or nothing at all, 
are marked H (for history-dependent), C (for hop-counting), and P (for positional), respectively. 

These classes form a simple inclusion hierarchy (H D C D P) and in general they yield different 
maximum reachability probabilities. However, it is known that for uniform CTMDPs the maximum 
reachability probabilities of classes H and C coincide [2 J. Uniform CTMDPs have a uniform transition 
rate X for all their actions. 

Optimal schedulers. Given its practical importance, the bounded reachability problem for Markov 
decision processes (and their deterministic counterpart the Markov chains) has been intensively stud- 
ied mm ma a. 

While previous research focused on approximating optimal scheduling policies [2], the existence of 
optimal schedulers for all scheduler classes has been demonstrated in Rabe's master thesis lfT3l[T4l . on 
which this paper is partly based. Meanwhile, Brazdil et al. [3] have independently provided a similar 
result for uniform Markov games, that is, for games that use the same transition rate for all actions. 

Contribution. We start with a report on our work on counting (C) and history dependent (H) schedulers 
in uniform CTMDPs. Although the case of the counting schedulers could by now be inferred as a 
corollary from the existence of optimal counting strategies in Markov games O, we decided to present 
it for 2.5 reasons: Firstly, it requires only marginal extra effort. Secondly, CTMDPs have been an 
important object of study for decades whereas Markov games are comparably new, and we think that 
our proof can provide insights in particular to readers that are not familiar with games. Finally, it was 
developed independently and at the same time. 

We then show how our result on uniform CTMDPs can be lifted to general CTMDPs, and that 
randomisation cannot improve the quality of optimal scheduling. In Section 01 we show that our lifting 
argument naturally extends to Markov games: We show that there are optimal time-abstract counting and 
history dependent schedulers with finite memory for general Markov games and that — as for CTMDPs — 
randomisation cannot improve optimal scheduling for either player. 

Our solution builds on the observation that, if time has almost run out, we can use a greedy strategy 
that optimises our chances to reach our goal in fewer steps rather than in more steps. We show that a 
memoryless greedy scheduler exists, and is indeed optimal after a certain step bound. The existence of 
an optimal scheduler is then implied by the finite number of remaining candidates — it suffices to search 
among those schedulers that deviate from the greedy strategy only in a finite preamble. 

The extension to non-uniform CTMDPs (and Markov games) builds upon a simple uniformisation 
technique and draws from a class of schedulers that are (partially) blind to the additional information 
introduced by the uniformisation. With the help of this scheduler class, we successively demonstrate that 
it is optimal (in the game case for both players) to turn to a fixed memoryless greedy strategy after a 
finite number of steps that is easy to compute. Hence, we can focus on scheduling policies that deviate 
from this scheduling policy only on a finite preamble. It then suffices to exclude that randomisation can 
improve the result (for either player) to reduce the candidate strategies to a finite set, and hence to infer 
the existence of simple optimal strategies for the non-uniform case as well. 
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2 Continuous-Time Markov Decision Processes 

A continuous -time Markov decision process M is a tuple (L,Ac/,R,V,B) with a finite set of locations 
L, a finite set of actions Act, a rate matrix R : (L x Act xL) ^ Q^o> an initial distribution v G Dist(L), 
and a goal region B C L. We define the total exit rate for a location / and an action a as R(Z,a,L) = 
^// ei R(/,a,/'). For a CTMDP we require that, for all locations / G L, there must be an action a G Act 
such that R(Z,a,L) > 0, and we call such actions enabled. We define Act(l) to be the set of enabled 
actions in location /. If there is only one enabled action per location, a CTMDP M is a continuous-time 
Markov chain QQ. If multiple actions are available, we need to resolve the nondeterminism by means 
of a scheduler (also called strategy or scheduling policy). As usual, we assume the goal region to be 
absorbing, and we use P(/,a,/') = g^'^) to denote the time-abstract transition probability. 

Uniform CTMDPs. We call a CTMDP uniform with rate X if, for every location / and action a G 
Act(l), the total exit rate R(/,a,L) is X. In this case the probability pit( n ) that there are exactly n discrete 
events (transitions) in time t is Poisson distributed: pit(n) = e~ • ^r/~- 

We define the uniformisation u of a CTMDP fW as the uniform CTMDP obtained by the following 
transformation steps. We create a copy l u for every / G L and obtain L u = U/ez,{M« }■ We call the new 
copies unobservable, and all locations I G L observable. Let X be the maximal total exit rate in M . The 
new rate matrix R w extends R by first adding the rate R w (l,a,l n ) = X — K(l,a,L) for every location / GL 
and action a G Act of fW , and by then copying the outgoing transitions from every observable location / 
to its unobservable counterpart l u , while the other components remain untouched. The intuition behind 
this uniformisation technique is that it enables us to distinguish whether a step would have occurred in 
the original automaton or not. 

Paths. A timed path in CTMDP M is a finite sequence in (L x Act x Mj>o)* x L = Paths (M ). We write 

; «0,f(> ; filial On-lfn-\. , 

'o > n >■■■ > hi 

for a sequence n, and we require f,_i < t\ for all i < n. The denote the system's time when the events 
happen. The corresponding time-abstract path is defined as Iq l\ ^h- ■ ■ ■ -^-^ /„. We use Paths a i n (M ) 
to denote the set of all such projections and | • [ to count the number of actions in a path. Concatenation 
of paths %,%' will be written as n o n' if the last location of 71 is the first location of n'. 

Schedulers. The system's behaviour is not fully determined by the CTMDP, we additionally need a 
scheduler that resolves the nondeterminism that occurs in locations where multiple actions are enabled. 
When analysing properties of a CTMDP, such as the reachability probability, we usually quantify over a 
class of schedulers. In this paper, we consider the following common scheduler classes, which differ in 
their power to observe and distinguish events: 

o Time-abstract history-dependent (H) schedulers Paths a b s (M ) — > D 

that map time-abstract paths to decisions. 

o Time-abstract hop-counting (C) schedulers LxN^D 
that map locations and the length of the path to decisions. 

o Positional (P) or memoryless schedulers L — ^ D 

that map locations to decisions. 
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Decisions D are either randomised (R), in which case D = Dist(Act) is the set of distributions over en- 
abled actions, or are restricted to deterministic (D) choices, that is D = Act. Where it is necessary to 
distinguish randomised and deterministic versions we will add a postfix to the scheduler class, for exam- 
ple HD and HR. We restrict all scheduler classes to those schedulers creating a measurable probability 
space (cf. El). 

Induced Probability Space. We build our probability space in the natural way: we first define the 
probability measure for cylindric sets of paths that start with 

. «o,'o , Ol)fl On-ltn-l , 
<0 Ml >■■■ > h, 

with tj € Ij for all j < n, and for non-overlapping open intervals 7o,/i,...,/ n -i, to be the usual 
probability that a path starts with these actions for a given randomised scheduler S, and such that 

S (Iq . ■ . Q '~'' f '~'> k) is equivalent for all (to, ■ ■ ■ , U-\ ) £ Iq x . . . x : 

JtQ&loJi e/i ,....f„-i e/„-i ; = o 
assuming t-\ = 0. 

From this basic building block, we build our probability measure for measurable sets of paths and 
measurable schedulers in the usual way (cf. fTTl ). 

Time-Bounded Reachability Probability. For a given CTMDP M = (L,Act,K,v,B) and a given 
measurable scheduler S that resolves the nondeterminism, we use the following notations for the proba- 
bilities: 

o Prf (l,t) is the probability of reaching the goal region B in time t when starting in location /, 

o Prf (t) = L/eL v (0^' r 5 / (AO denotes the probability of reaching the goal region B in time t, 

o Prf (t;k) denotes the probability of reaching the goal region B in time t and in at most k discrete 
steps, and 

o PRM (%,t)is the probability to traverse the time-abstract path jt within time t. 

As usual, the supremum of the time-bounded reachability probability over a particular scheduler 
class is called the time-bounded reachability of M for this scheduler class, and we use 'max' instead of 
'sup' to indicate that this value is taken for some optimal scheduler 5 of this class. 

Step Probability Vector. Given a scheduler S and a location / for a CTMDP <M , we define the step 
probability vector di s of infinite dimension. An entry di j [i] for i > denotes the probability to reach 
goal region B in up to i steps from location / (not considering any time constraints). 

3 Optimal Time-Abstract Schedulers 

In this section, we show that optimal schedulers exist for all natural time-abstract classes, that is, for CD, 
CR, HD, and HR. Moreover, we show that there are optimal schedulers that become positional after a 
small number of steps, which we compute with a simple algorithm. We also show that randomisation 
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does not yield any advantage: deterministic schedulers are as good as randomised ones. Our proofs 
are constructive, and thus allow for the construction of optimal schedulers. This also provides the first 
procedure to precisely determine the time-bounded reachability probability, because we can now reduce 
this problem to solving the time-bounded reachability problem of continuous-time Markov chains fT). 

Our proof consists of two parts. We first consider the class of uniform CTMDPs, which are much 
simpler to treat in the time-abstract case, because we can use Poisson distributions to describe the number 
of steps taken within a given time bound. For uniform CTMDPs it is already known that the supremum 
over the bounded reachability collapses for all time-abstract scheduler classes from CD to HR d. It 
therefore suffices to show that there is a CD scheduler which takes this value. 

We then show that a similar claim holds for CD and HD scheduler in the general class of not neces- 
sarily uniform CTMDPs. In this case, it also holds that there are simple optimal schedulers that converge 
against a positional scheduler after a finite number of steps, and that randomisation does not improve the 
time-bounded reachability probability. However, in the non-uniform case the time-abstract path contains 
more information about the remaining time than its length only, and bounded reachability of history- 
dependent and counting schedulers usually deviate (see f2| for a simple example). 

We start this section with the introduction of greedy schedulers, HD schedulers that favour reachabil- 
ity in a small number of steps over reachability with a larger number of steps; the positional schedulers 
against which the CD and HD schedulers converge are such greedy schedulers. 

3.1 Greedy Schedulers 

The objective we consider is to maximise time-bounded reachability Prf (l,t) for every location / with 
respect to a particular scheduler class such as HD. Unfortunately, this optimisation problem is rather 
difficult to solve. Therefore, we start with analysing the special case of having little time left (that is, the 
remaining time t is close to 0). 

Time-abstract schedulers have no direct access to the time, but they can infer the distribution over 
the remaining time from the time-abstract history (or its length). When examining the resulting Poisson 
distribution one can easily see that for large step numbers the probability to take more than one further 
step declines faster than the probability to take exactly one further step. Thus, any increase of the 
likelihood of reaching the goal region sooner dominates the potential impact of reaching it in further 
steps (after sufficiently many steps). 

This motivates the introduction of greedy schedulers. Schedulers are called greedy, if they (greedily) 
look for short-term gain, and favour it over any long-term effect. Greedy schedulers that optimise the 
reachability within the first k steps have been exploited in the efficient analysis of CTMDPs (2). To 
understand the principles of optimal control, however, a simpler form of greediness proves to be more 
appropriate: We call an HD scheduler greedy if it maximises the step probability vector of every location 
/ with respect to the lexicographic order (for example (0,0.2,0.3, . . . ) >i ex (0,0.1,0.4, ...)). To prove the 
existence of greedy schedulers, we draw from the fact that the supremum di = sup seHD di }S obviously 
exists, where the supremum is to be read as a supremum with respect to the lexicographic order. An 
action a £ Act{l) is called greedy for a location / ^ B if it satisfies shift(di) = £// ei P(/,a,/')<5?/', where 
shift{di) shifts the vector by one position (that is, shift(di)\i] = di[i + 1] V/ G N). For locations / in the 
goal region B, all enabled actions a € Act (I) are greedy. 

Lemma 3.1 Greedy schedulers exist, and they can be described as the class of schedulers that choose a 
greedy action upon every reachable time-abstract path. 

Proof It is plain that, for every non-goal location I ^ B, shift{di) > ^ / / eL P(/,a,/ / )(i/' holds for every 
action a, and that equality must hold for some. 
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For a scheduler S that always chooses greedy actions, a simple inductive argument shows that di [i] = 
di tS [i] holds for all i € N, while it is easy to show that d\ > d[ S holds if S deviates from greedy decisions 
upon a path that is possible under its own scheduling policy and does not contain a goal location. □ 

This allows in particular to fix a positional standard greedy scheduler by fixing an arbitrary greedy 
action for every location. 

To determine the set of greedy actions, let us consider a deterministic scheduler S that starts in a 
location / with a non-greedy action a. Then shift(di s ) < ^// ei P(/,a,/')<i/' holds true, where the sum 
L/'ez,P(/,a,/')^' corresponds to the scheduler choosing the non-greedy action a at location / and act- 
ing greedy in all further steps. Let d\ A = £// G£ P(/,a,/')<i/' denote the step probability vector of such 
schedulers. 

We know that d\ s < di A < d\. Hence, there is not only a difference between d^ s and di, this difference 
will not occur at a higher index than the first difference between the newly defined d\ a and d\. The 
finite number of locations and actions thus implies the existence of a bound k on the occurrence of this 
first difference between d^ a and di as well as di jS and d\. While the existence of such a k suffices to 
demonstrate the existence of optimal schedulers, we show in Subsection [33] that this constant k < \L\ is 
smaller than the CTMDP itself. 

Having established such a bound k, it suffices to compare schedulers up to this bound. This provides 
us with the greedy actions, and also with the initial sequence J/ a [0],J/ a [l], . . . ,di ia [k] for all locations 
/ and actions a. Consequently, we can determine a positive lower bound p. > for the first non-zero 
entry of the vectors d\ — d/ s (considering all non-greedy schedulers 5). We call this lower bound p. the 
discriminator of the CTMDP. Intuitively, the discriminator p represents the minimal advantage of the 
greedy strategy over non-greedy strategies. 

3.2 Uniform CTMDPs 

In this subsection, we show that every CD or HD scheduler for a uniform CTMDP can be transformed 
into a scheduler that converges to this standard greedy scheduler. 

In the quest for an optimal scheduler, it is useful to consider the fact that the maximal reachability 
probability can be computed using the step probability vector, because the likelihood that a particular 
number of steps happen in time t is independent of the scheduler: 

CO 

^W = Iv(/)£^]-a,(0- (i) 

leL i=Q 

Moreover, the Poisson distribution has the useful property that the probability of taking k steps is 
falling very fast. We define the greed bound n u to be a natural number, for which 

oo 

> £/»Ji/(n + V«>«^ (2) 
1=1 

holds true. It suffices to choose n u > ^- since itimplies ppxt(n) > 2pfa(n+ 1), > n^ (which yields 
© by simple induction). Such a greed bound implies that the decrease in likelihood of reaching the goal 
region in few steps caused by making a non-greedy decision after the greed bound dwarfs any potential 
later gain. We use this observation to improve any given CD or HD scheduler S that makes a non-greedy 
decision after >n M steps by replacing the behaviour after this history by a greedy scheduler. Finally, we 
use the interchangeability of greedy schedulers to introduce a scheduler 5 that makes the same decisions 
as S on short histories and follows the standard greedy scheduling policy once the length of the history 
reaches the greed bound. For this scheduler, we show that Pr- {t)>Pr^ (t) holds true. 
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Theorem 3.2 For uniform CTMDPs, there is an optimal scheduler for the classes CD and HD that 
converges to the standard greedy scheduler after n^ steps. 

Proof Let us consider any HD scheduler S that makes a non-greedy decision after a time-abstract path 
71 of length |tc| > n M with last location /. If the path ends in, or has previously passed, the goal region, or 
if the probability of the history % is 0, that is, if it cannot occur with the scheduling policy of S , then we 
can change the decision of S on every path starting with 7t arbitrarily — and in particular to the standard 
greedy scheduler — without altering the reachability probability 

If PR? (%,t) > 0, then we change the decisions of the scheduler S for paths with prefix 71 such that 
they comply with the standard greedy scheduler. We call the resulting HD scheduler S ' and analyse the 
change in reachability probability using Equation (Q]): 

oo 

Pr? (t)-Pr? (t) = PR? (71,0 • I>[i1 -d,,sM • W/(N +0, 

i'=0 

where 5^ : %' i-4 S (% o 7t') is the HD scheduler which prefixes its input with the path 7t and then calls the 
scheduler 5 . The greedy criterion implies dj > <f/ >i5jt with respect to the lexicographic order, and after 
rewriting the upper equation: 

Pr? {t) - Pr? (t) = PR? (TV) • Lp h ( \n\ + j) + f> [i\ - d lfSx [i]) • p h (|jc| + i)\ (for some j > 0) 

we can apply Equation[2]to deduce that the difference Pr? (t) —Pr? (t) is non-negative. 

Likewise, we can concurrently change the scheduling policy to the standard greedy scheduler for all 
paths of length > n^ for which the scheduler S makes non-greedy decisions. In this way, we obtain a 
scheduler s" that makes non-greedy decisions only in the first n^ steps, and yields a (not necessarily 
strictly) better time-bounded reachability probability than S . 

Since all greedy schedulers are interchangeable without changing the time-bounded reachability 
probability (and even without altering the step probability vector), we can modify s" such that it fol- 
lows the standard greedy scheduling policy after > n u steps, resulting in a scheduler S that comes with 
the same time-bounded reachability probability as s". Note that s is counting if S is counting. 

Hence, the supremum over the time-bounded reachability of all CD/HD schedulers is equivalent to 
the supremum over the bounded reachability of CD/HD schedulers that deviate from the standard greedy 
scheduler only in the first n M steps. This class is finite, and the supremum over the bounded reachability 
is therefore the maximal bounded reachability obtained by one of its representatives. □ 

Hence, we have shown the existence of a — simple — optimal time-bounded CD scheduler. Using the 
fact that the suprema over the time-bounded reachability probability coincide for CD, CR, HD, and HR 
schedulers (2), we can infer that such a scheduler is optimal for all of these classes. 

Corollary 3.3 max Pr? (t) = max Pr? (t) holds for all uniform CTMDPs M . □ 
3.3 Non-uniform CTMDPs 

Reasoning over non-uniform CTMDPs is harder than reasoning over uniform CTMDPs, because the like- 
lihood of seeing exactly k steps does not adhere to the simple Poisson distribution, but depends on the 
precise history. Even if two paths have the same length, they may imply different probability distributions 
over the time passed so far. Knowing the time-abstract history therefore provides a scheduler with more 
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information about the system's state than merely its length. As a result, it is simple to construct exam- 
ple CTMDPs, for which history-dependent and counting schedulers can obtain different time-bounded 
reachability probabilities Q. 

In this subsection, we extend the results from the previous subsection to general CTMDPs. We show 
that simple optimal CD/HD scheduler exist, and that randomisation does not yield an advantage: 



To obtain this result, we work on the uniformisation 11 of M instead of working on M itself. We 
argue that the behaviour of a general CTMDP M can be viewed as the observable behaviour of its 
uniformisation 11 , using a scheduler that does not see the new transitions and locations. Schedulers from 
this class can then be replaced by (or viewed as) schedulers that do not use the additional information. 
And finally, we can approximate schedulers that do not use the additional information by schedulers that 
do not use it initially, where initially means until the number of visible steps — and hence in particular 
the number of steps — exceeds the greed bound n u of the uniformisation 11 of M . Comparable to the 
argument from the proof of Theorem 13.21 we show that we can restrict our attention to the standard 
greedy scheduler after this initial phase, which leads again to a situation where considering a finite class 
of schedulers suffices to obtain the optimum. 

Lemma 3.4 The greedy decisions and the step probability vector coincide for the observable and unob- 
servable copy of each location in the uniformisation 11 of any CTMDP M . 

Proof The observable and unobservable copy of each location reach the same successors under the same 
actions with the same transition rate. □ 

We can therefore choose a positional standard greedy scheduler whose decisions coincide for the 
observable and unobservable copy of each location. 

For the uniformisation 11 of a CTMDP M , we define the function vis : Paths a bs{u ) — > Paths a bs{M ) 
that maps a path ji of 11 to the corresponding path in M , the visible path, by deleting all unobservable 
locations and their directly preceding transitions from ji. (Note that all paths in 11 start in an observable 
location.) We call a scheduler n-visible if its decisions only depend on the visible path and coincide for 
the observable and unobservable copy of every location for all paths containing up to n visible steps. We 
call a scheduler visible if it is n-visible for all n € N. 

We call a HD/HR scheduler an (n-)visible HD/HR scheduler if it is (n-)visible, and we call an 
(n-)visible HD/HR scheduler a visible CD/CR scheduler if its decisions depend only on the length of 
the visible path, and an n-visible CD/CR scheduler if its decisions depend only on the length of the visi- 
ble path for all paths containing up to n visible steps. The respective classes are denoted with according 
prefixes, for example, n-vCD. Note that (n-)visible counting schedulers are not counting. 

It is a simple observation that we can study visible CD, CR, HD, and HR schedulers on the uniformi- 
sation U of a CTMDP M instead of studying CD, CR, HD, and HR schedulers on M . 

Lemma 3.5 S *- >S ovis is a bijection from visible CD, CR, HD, orHR schedulers for the uniformisation 
11 of a CTMDP *M onto CD, CR, HD, or HR schedulers, respectively, of 5W that preserves the time- 
bounded reachability probability: Pr^ (?) = Prj ovis (t\ □ 

At the same time, copying the argument from the proof of Theorem 13.21 an n u -visible CD or HD 
scheduler S can be adjusted to the n u -visible CD or HD scheduler S that deviates from S only in that it 
complies with the standard greedy scheduler for 11 after n u visible steps, without decreasing the time- 
bounded reachability probability. These schedulers are visible schedulers from a finite sub-class, and 



seCD seCR 



max/Vf (t) = max/Vf (t 



and 



max/Vf (t) = maxPrf (t 
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hence some representative of this class takes the optimal value. We can, therefore, construct optimal CD 
and HD schedulers for every CTMDP !M . 

Lemma 3.6 The following equations hold for the uniforniisation U of a CTMDP M : 

max PrY(t)= max Pr^ (t) and max PrV(t)= maxPrVk). 

Sen u -vCD ° sevCD " sen u -vHD ° S&HD 



Proof We have shown in Theorem |X2] that turning to the standard greedy scheduling policy after n u or 
more steps can only increase the time-bounded reachability probability. This implies that we can turn to 
the standard greedy scheduler after n u visible steps. 

The scheduler resulting from this adjustment does not only remain n u -visible, it becomes a visible 
CD and HD scheduler, respectively. Moreover, it is a scheduler from the finite subset of CD or HD 
schedulers, respectively, whose behaviour may only deviate from the standard scheduler within the first 
n u visible steps. □ 

To prove that optimal CD and HD schedulers are also optimal CR and HR schedulers, respectively, 
we first prove the simpler lemma that this holds for fc-bounded reachability. 

Lemma 3.7 k-optimal CD or HD schedulers are also k-optimal CR or HR schedulers, respectively. 

Proof For a CTMDP M we can turn an arbitrary CR or HR scheduler S into a CD or HD scheduler 
s' with a time and ^-bounded reachability probability that is at least as good as the one of S by first 
determinising the scheduler decisions from the (k+ l)st step onwards — this has obviously no impact on 
^-bounded reachability — and then determinising the remaining randomised choices. 

Replacing a single randomised decision on a path jt (for history-dependent schedulers) or on a set 
of paths IT (for counting schedulers) that end(s) in a location / is safe, because the time and ^-bounded 
reachability probability of a scheduler is an affine combination — the affine combination defined by S (it) 
and 5(|7i|,Z), respectively — of the |Ac?(/)| schedulers resulting from determinising this single decision. 
Hence, we can pick one of them whose time and ^-bounded reachability probability is at least as high as 
the one of S . 

As the number of these randomised decisions is finite (< k \L\ for CR, and < k^ for HR schedulers), 
this results in a deterministic scheduler after a finite number of improvement steps. □ 

Theorem 3.8 Optimal CD schedulers are also optimal CR schedulers. 

Proof First, for n — > oo the probability to reach the goal region B in exactly n or more than n steps 
converges to 0, independent of the scheduler. Together with Lemma [3771 this implies 

sup Prf (t) = lim sup Prf (t;n) = lim sup Prf (t;n) < maxPrf (/), 
seat n ^°°seCR n ^°°seCD s^CD 

where equality is implied by CD C CR. □ 

Analogously, we can prove the similar theorem for history-dependent schedulers: 
Theorem 3.9 Optimal HD schedulers are also optimal HR schedulers. □ 



3.4 Constructing Optimal Schedulers 

The proof of the existence of an optimal scheduler is not constructive in two aspects. First, the compu- 
tation of a positional greedy scheduler requires a bound for k, which indicated the maximal depth until 
which we have to compare the step probability vectors before we can ascertain equality. Second, we 
need an exact method to compare the quality of two (arbitrary) schedulers. 
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A bound for k The first property is captured in the following lemma. Without this lemma, we could 
only provide an algorithm that is guaranteed to converge to an optimal scheduler, but would be unable 
to determine whether an optimal solution has already been reached, as we never know when to stop 
when comparing step probability vectors. In this lemma, however, we show that it suffices to check for 
equivalence of two step probability vectors only up to position \L\ — 2. As discussed in Subsection 13. II 
this enables us to identify greedy actions and thus to compute the discriminator p and consequently the 
greed bound . 

Lemma 3.10 Given a uniform CTMDP 94. , the smallest k that satisfies VZ E L, a G Act(l). di ^ d^ a =>■ 
3k' < k. di[k'] > d^ a [k'] is bounded by \L\ — 1. 

Proof The techniques we exploit in this proof draw from linear algebra, and are, while simple, a bit 
unusual in this context. We first turn to the simpler notion of Markov chains by resolving the nonde- 
terminism in accordance with the positional standard greedy scheduler S whose existence was shown in 
Subsection 13. II 

We first lift the step probability vector from locations to distributions, where d v = L/eL v (0^/ i s > 
for a distribution v : L — > [0, 1], the affine combination of the step probability vectors of the individual 
locations. In this proof, we define two distributions v,v' : L — > [0, 1] to be equivalent, if their step proba- 
bility vectors d v = d v > are equal. Further, we call them /-step equivalent if they are equal up to position i 

(Vj<i.d v [j]=d v ,[j}). 

In order to argue with vector spaces, we extend these definitions to arbitrary vectors V : L — > M 
(instead of v [0,1]). 

Let D, be the vector space spanned by /-step equivalent distributions v , v' over L. Naturally, D, D A+ 1 
always holds, as / + 1 step equivalence implies /-step equivalence. In addition we show that Do has 
\L\ — 2 dimensions, and that D, = D i+; - implies that a fixed point is reached, which together implies that 
D\ L \_ 2 = Dj for all ; > |L| - 2. 

• Do has \L\ — 2 dimensions: Do is the vector space that contains the multitudes of differences 
8 = X(v — v') of distributions v,v' : L — > [0, 1] that are equally likely in the goal region (due to 
0-step equivalency; d v [0] = d v >[0}). 

The fact that v and v' are distributions implies L/ £ lV(/) = 1 and L/ £ lV'(Z) = 1, and hence 
L/eL°X0 = 0. Further, the fact that v and v' are equally likely in the goal region implies 
L/gb v (0 = L/es v '(0> an d hence X^ e g5(Z) = 0. Thus, Do has |L| —2 dimensions. (Assuming 
B 7^ L,B / 0, but otherwise every scheduler has equal quality.) 

• Once we have constructed D,, we can construct the vector space 0\ that contains a vector 8 if it is 
a multitude 8 = X(v — v') of differences v — v' of distributions, such that shift(d v ) and shift{d v i) 
are /-step equivalent, that is, shift(d v ) — shift(d v >) G D,-. 

The transition from step probability vectors to the shift of them is a simple linear operation, which 
transforms the distributions according to the transition matrix of the embedded DTMC. Hence, we 
can obtain 0, from D, by a simple linear transformation of the vector space. 

• Two step probability vectors are /+ 1-step equivalent if (1) they are /-step equivalent, and (2) their 
shift are /-step equivalent. Therefore D !+ i = D; n O,- can be obtained by an intersection of the two 
vector spaces D ; and 0\. 

Naturally, this implies that the vector spaces are shrinking, that is, Do 5 D\ D . . . D D\L\-2 5 •••> and 
that Di = D !+ i implies that a fixed point is reached. (It implies 0\ = Oi + \ and hence D, = Dj (Vj > /) by 
a simple inductive argument.) 



154 



Optimal Time-Abstract Schedulers for CTMDPs and Markov Games 



As D is an \L\ — 2 dimensional vector space, and inequality (D, ^ A+i) implies the loss of at 
least one dimension, a fixed point is reached after at most \L\ — 2 steps. That is, two distributions are 
equivalent, if, and only if, they are (|L| — 2)-step equivalent. 

Having established this, we apply it on the distribution V/. fl obtained in one step from a position 
/ ^ B when choosing the action a, as compared to the distribution V/ obtained when choosing the action 
according to the positional greedy scheduler. 

Now, di > d\, a holds if, and only if shift(d\) = d Vj > d Vla = shift(d^ a ), which implies d Vl [k'] > d Vla {k'} 
for some k' < \L\ — 2, and hence di[k] > di a [k] for some k < \L\. □ 

Comparing schedulers So far, we have narrowed down the set of candidates for the optimal scheduler 
to a finite number of schedulers. To determine the optimal scheduler, it now suffices to have a comparison 
method for their reachability probabilities. 

The combination of each of these schedulers with the respective CTMDP can be viewed as a finite 
continuous-time Markov chain (CTMC) since they behave like a positional scheduler after n M steps. 
Aziz et al. CD have shown that the time-bounded reachability probability of CTMCs are computable 
(and comparable) finite sums L; e /ili£ 8 ', where the individual r|,- and 8,- are algebraic numbers. 

We conclude with a constructive extension of our results: 

Corollary 3.11 We can effectively construct optimal CD, CR, HD, and HR schedulers. □ 

Corollary 3.12 We can compute the time-bounded reachability probability of optimal schedulers as fi- 
nite sums L, e /Tl;e S ', where the T\i and 8; are algebraic numbers. □ 

Complexity 

These corollaries rely on the precise CTMC model checking approach of Aziz et al. HI, which only 
demonstrates the effective decidability of this problem. We deem it unlikely that a complexity for finding 
optimal strategies can be provided prior to determining the respective CTMC model checking complexity. 

3.5 Example 

To exemplify our proposed construction, let us consider the example CTMDP <M depicted in Figure Q] 
As M is not uniform, we start with constructing the uniformisation 11 of M (cf. Figure [TJ. 

U has the uniform transition rate X = 6. Independent of the initial distribution of , the unobservable 
copies of l\ and h are not reachable in u , because the initial distribution of a uniformisation assigns all 
probability weight to observable locations, and the transition rate of all enabled actions in l\ and h 
in M is already X. (Unobservable copies of a location I are only reachable from the observable and 
unobservable copy of / upon enabled actions a with non-maximal exit rate R(/,a,L) ^ X.) 

Disregarding the unreachable part of u , there are only 8 positional schedulers for u , and only 4 of 
them are visible (that is, coincide on Iq and i u .o). They can be characterised by 5i = {lo i-)-a, l\ i-> a}, 
S2 = {h l— > ®, h i- > b}, S3 = {lo ' — ^ bj l[ 1 — y a}, and 54 = {lo *-lb,l\\-} b}. In order to determine a 
greedy scheduler, we first determine step probability vectors: 

For lo- d[ Q j l dl Q j 2 ( 5 ' 9 ' 27 '•'•)' ^'0.^3 (2'l2'72'' - ')' ^'0,^4 (2'2'4'' -- )' 

For l\\ di l , S[ = di uSj = (g, j(j,jYE,- • • ), di uSl = (0, 5, §,...)> ^Hm = (0; \, 2> • • • )■ 
Note that, in the given example, it suffices to compute the step probability vector for a single step to 
determine that 53 is optimal (w.r.t. the greedy optimality criterion); in general, it suffices to consider as 
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Figure 1: The example CTMDP M (left) and the reachable part of its uniformisation 11 (right). 



many steps as the CTMDP has locations. Since deviating from S3 decreases the chance to reach the goal 
location I2 in a single step by g both from /<) and l\ , the discriminator fj = g is easy to compute. 

Our coarse estimation provides a greed bound of n u = [72 • t~\ , where t is the time bound, but n u = 
[42 • t] suffices to satisfy Equation (f2]). 

When seeking optimal schedulers from any of the discussed classes, we can focus on the finite set 
of those schedulers that comply with 53 after n u (visible) steps. In the previous subsection, we describe 
how the precise model checking technique of Aziz et al. d can be exploited to turn the existence proof 
into an effective technique for the construction of optimal schedulers. 



4 Extension to Continuous-Time Markov Games 

Markov decision processes can easily be extended to continuous-time Markov games (CTGs) g = 
(L A ,L D ,Act,R,v,B) by disintegrating the set of locations into game positions of amaximiser (L A , angelic 
game positions) and a minimiser (Lp, demonic game positions). These two players have antagonistic ob- 
jectives to maximise and minimise the time-bounded reachability probability. These games are closely 
related to the CTMDP framework, and we define, for a given Markov game g , the underlying CTMDP 
M = (LAUL£>,Acf,R,v,B). CTGs are called uniform if their underlying CTMDP is uniform. 

The players can choose an action upon the entrance to one of their locations, and, as with sched- 
ulers for CTMDPs, they may have limited access to the timed history of the system. We only consider 
time-abstract strategies Sx '■ Pathsf lbs (g) — > DistiAct) for both players, where paths are defined over the 
underlying CTMDP, and Paths^ bs (g) (for X e {A,D}) is the set of paths that end with a location in L x . 

Obviously, there is a one-to-one mapping between combined strategies 



, M _/^W if ne Paths% s (g) 
SA+D{%) ~ \ s D (n) i£nePaths° bs (g) 



of a CTG and schedulers of the underlying CTMDP. 

For a given CTG and a pair of strategies Sa, Sd we define the according probability space equivalent 
to the probability space of the underlying CTMDP with the combined strategy Sa+d- Then, the time- 
bounded reachability probability can be formulated for CTGs as follows: 

su P infP d +o ( f ) = infsupPr^ +D (0 (3) 

S A Sd Sd Sa 



where equality is guaranteed by [3, Theorem 3.1]. 

For uniform CTGs, a theorem similar to Theorem 13 .21 has recently been shown: 
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Theorem 4.1 [3 J For uniform CTGs Q with counting strategies, we can compute a bound n C j ( compara- 
ble to our greed bound) and a memoryless deterministic greedy strategy S :L—> Act, such that following 
S is optimal for both players after ng steps. 

That is, optimal (counting) strategies for uniform Markov games have a similarly simple structure as 
those for CTMDPs. Now, we extend these results to history-dependent (HD and HR) schedulers: 

Theorem 4.2 The optimal CD strategies from Theorem \4.1\ ( that is, for uniform CTGs) are also optimal 
HR strategies. 

Proof Let us assume the minimiser plays in accordance with her optimal CD strategy. Let us further 
assume that the maximiser has an HR strategy that yields a better result than his CD strategy. Then it 
must improve over his optimal CD strategy by a margin of some £. 

Let us define p(k,l) as the maximum of the probabilities to still reach the goal region in the future 
that the maximiser can reach under the paths of length k which end in location / with the better history 
dependent strategy. Further, let hi (k) be a path where this optimal value is taken. (Note that our goal 
region is absorbing.) The decision this HR scheduler takes is an affine combination of deterministic 
decisions, and the quality (the probability of reaching the goal region in the future) is the respective 
affine combination of the outcome of these pure decisions. Hence, there is at least one pure decision that 
(not necessarily strictly) improves over the randomised decision. 

As our CTG is uniform, we can improve this history dependent scheduler by changing all decisions 
it makes on a path Ji = tcJ o %' that start with a path 7iJ of length 2 ending in a location I, to the decisions 
it made upon the path h;(2) on'. (The improvement is not necessarily strict.) We then improve it further 
(again not necessarily strictly) by turning to the improved pure decision. The resulting strategy is initially 
counting — it depends only on the length of the history and the current location — and deterministic for 
paths up to length 2. 

Having constructed a history dependent scheduler that is initially counting and deterministic for paths 
up to length k, we repeat this step for paths n = n'j o n' that start with a history 7iJ of length k + 1 , where 
we replace the decision made by our initially k counting and deterministic scheduler by the decision 
made on hi(k + 1) on', and then further to its deterministic improvement. This again leads to a — not 
necessarily strict — improvement. 

Once the probability of making at least k steps falls below £, any deterministic counting scheduler 
that agrees on the first k steps with a history dependent scheduler from this sequence (which is initially 
counting and deterministic for at least k steps) improves over the counting scheduler we started with for 
the maximiser, which contradicts its optimality. 

A similar argument can be made for the minimiser. □ 

Our argument that infers the existence of optimal strategies for general CTMDPs from the existence 
of optimal strategies for uniform CTMDPs does not depend on the fact that we have only one player with 
a particular objective. In fact, it can be lifted easily to Markov games. 

Theorem 4.3 For a Markov game Q , we can effectively construct optimal CD and HD schedulers, which 
are also optimal CR and HR schedulers, respectively, and we can compute the time-bounded reachability 
probability of optimal schedulers as finite sums L, e /Tli^ 8 ' , where the t|, and 8,- are algebraic numbers. 

Proof sketch We start again with the uniformisation 11 of the Markov game Q . By Theorem 14. II there 
is a deterministic memoryless greedy strategy for both players in u that is optimal after n u steps. Hence, 
we can argue along the same lines as for CTMDPs: 
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• We study the visible strategies on the uniformisation u of Q . Like in the constructions from 
Section [331 we use a bijection vis from the visible strategies on 11 onto the strategies of Q , which 
preserves the time-bounded reachability. 

• We define n u -visible strategies analogously to the n u -visible schedulers to be those strategies, 
which can use the additional information provided by 11 after n u visible steps have passed. 

After n u visible steps, the class of n u -visible strategies clearly contains the deterministic 
greedy strategies described in the previous theorems of this section, as they can use all infor- 
mation after step n u . Using Theorem 14. 1 1 we can deduce that, for both players, it suffices to seek 
an optimal n u -visible strategy in the subset of those strategies that turn to the standard greedy 
strategy after n u visible steps. 

• Locations I and their counterparts l u have exactly the same exit rates for all actions, and therefore 
a greedy-optimal memoryless strategy will pick the same action for both locations (up to equal 
quality of actions). This directly implies that the standard greedy scheduler is a visible strategy, 
and with it all n u -visible strategies that turn to the standard greedy strategy after n u visible steps 
are visible strategies. Hence, an optimal strategy for the class of n u -visible strategies that turn to 
the standard greedy strategy after n u visible steps is also optimal for the class of visible strategies 
(time-abstract strategies in Q , respectively). 

• For deterministic strategies, this class is finite, which immediately implies the existence of an 
optimum in this class (using Equation [3]). 

Randomised strategies again cannot provide an advantage over deterministic ones, because their 
outcome is just an affine combination of the outcome of the respective pure strategies, and the extreme 
points are taken at the fringe. (Technically, we can start with any randomised strategy and replace one 
randomised decision after another by a pure counterpart, improving the quality of the outcome — not 
necessarily strictly — for the respective player.) 

Consequently, we are left with a finite set of history dependent or counting candidate strategies, 
respectively, and the result can — at least in principle — be found by applying a brute force approach: For 
each of these deterministic strategies, we can compute the reachability probability using the algorithm 
of Aziz et al. [1], which allows for identifying the deterministic strategies that mark an optimal Nash 
equilibrium. □ 
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